The Turing Test in the Classroom 


Lisa Torrey, Karen Johnson, Sid Sondergard, Pedro Ponce, Laura Desmond 
St. Lawrence University 


Abstract 


This paper discusses the Turing Test as an educational activ- 
ity for undergraduate students. It describes in detail an exper- 
iment that we conducted in a first-year non-CS course. We 
also suggest other pedagogical purposes that the Turing Test 
could serve. 


Introduction 


The Turing Test occasionally appears as a hands-on activ- 
ity in K-12 classrooms (Lehman 2009) and museum ex- 
hibits (Hollister et al. 2013). However, the 2013 ACM cur- 
riculum suggests only that computer science majors should 
be able to describe it (The Joint Task Force on Computing 
Curricula 2013). 

We propose a few more active roles the Turing Test could 
play in the undergraduate classroom, both within and be- 
yond the CS major. For example, it could serve as: 


e A broad-interest introduction to computer science in a 
first-year course. 


e An exercise in experimental design and analysis in a psy- 
chology or statistics course. 


e A hands-on social activity in a course on philosophy or 
artificial intelligence. 


e A programming project for advanced CS students. 


We begin by motivating each of these examples. Then we 
elaborate on the first one by presenting a Turing Test that we 
conducted with a large group of first-year undergraduates. 
We describe how we organized the experiment, the software 
we designed for it, and the data it generated. Finally, we 
share some observations and advice that may help other in- 
structors plan successful Turing Tests. 


History of the Turing Test 


In 1950, Alan Turing asked his famous question: “Can ma- 
chines think?” To provide a way to answer it, he introduced 
the idea of the Imitation Game (Turing 1950). In this game, 
a human interrogator converses electronically with another 
human and with a machine, and must decide which is which. 
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If the interrogator cannot reliably tell the difference, Tur- 
ing suggested, then we ought to conclude that the machine 
thinks. 

The Imitation Game, known better today as the Turing 
Test, has been a matter of debate ever since. Turing’s sug- 
gestion had a bold implicit claim: that conversation is a com- 
plex enough behavior that conversational ability is sufficient 
evidence of intelligence. Objections (Searle 1980), as well 
as defenses (Levesque 2009), are numerous. 

In 1991, Hugh Loebner began sponsoring the Loebner 
Prize, an annual contest modeled after the Turing Test. Com- 
puter programs entered into the contest attempt to convince 
human judges that they are also human. The program that 
most often succeeds is awarded the annual prize. 

In the initial 1991 contest, 10 judges had 7-minute conver- 
sations with each of 6 programs and 2 human “confederates” 
and ranked them from least to most human. The judges and 
confederates were not computer scientists, and presumably 
they did not know each other. To make the program devel- 
opers’ task easier, conversations were restricted to specific 
topics. Participants were instructed to converse naturally, 
without “trickery or guile” (Shieber 1994). Although the 
two humans were not ranked first by every judge, their aver- 
age rankings were higher than any program. The programs 
may have been even less successful if the judges had been 
more knowledgeable in computer science, or had been per- 
mitted to conduct an interrogation rather than a polite con- 
versation (Shieber 1994). 

The format of the Loebner contest has varied over the sub- 
sequent decades in response to such observations (Loebner 
1991). Restrictions on conversational topics and styles were 
abandoned after the first few years. Turn-taking, which was 
enforced by early protocols, is lately left up to the partici- 
pants after judges initiate their conversations. In some years, 
messages have been transmitted character-by-character, re- 
quiring programs to simulate live typing. In some years, 
judges have interacted with both a human and a program si- 
multaneously on a split screen. Time limits on conversations 
have ranged from 5 to 25 minutes. 

The results of the Loebner contest have also varied, with- 
out a predictable trend. In 2000, judge decisions were 93% 
correct after 15 minutes in one-on-one conversation (Moor 
2001). In 2008, the winning program convinced 25% of the 
judges to make the wrong choice after 5 minutes on a split 


screen (Shah and Warwick 2009). In 2013, the judges were 
AI experts, and after 25 minutes on a split screen they made 
no mistakes at all (Loebner 1991). 

One interesting aspect of Turing Tests is that humans 
sometimes get labeled as computers. There are many rea- 
sons for these errors: cultural differences, language barri- 
ers, the ambiguity of natural language, and the limitations of 
electronic communication, to name a few. Variation in the 
eagerness and cooperativeness of the confederates may also 
play a role (Christian 2011). 


Turing Test as Advertisement 


Attracting students to study computer science is an ongoing 
endeavor in CS education. There are many challenges in- 
volved, but one of the major factors is unfamiliarity with the 
field. Students often get little exposure to computer science 
before college and may have negative preconceived notions 
about it (Carter 2006). 

Introductory CS courses can address this knowledge gap, 
but first we must get students to take them. Exposure to the 
Turing Test could be one way to pique students’ interest. It 
is a broadly accessible concept, requiring no particular back- 
ground knowledge to understand and discuss, and participa- 
tion requires no technical skill. It tends to be an interesting 
topic for students who have encountered aspects of artificial 
intelligence in both fiction and reality throughout their lives. 

Although nothing beats active participation, merely dis- 
cussing the Turing Test can be an interesting exercise. Tur- 
ing’s original paper (Turing 1950) is quite readable, and it 
can provoke lively discussions on what intelligence is and 
how to detect it. Introducing some controversy, via the fa- 
mous Chinese Room argument (Searle 1980) and its rebut- 
tals (Levesque 2009), can further enhance the debate. 

Although these discussions are largely philosophical, they 
inevitably elicit curiosity about more technical aspects of 
computer science. How would a conversation program 
work? Can a program make a choice? Is it capable of behav- 
ing randomly? These kinds of questions make good reasons 
to point students towards programming courses. 


Turing Test as Science Experiment 


The Turing Test translates a philosophical question into a 
scientific experiment with human participants. For students 
of disciplines like psychology and statistics, it could serve 
as a practical exercise in designing and carrying out such 
an experiment. Students would need to grapple with several 
experimental parameters and types of data. 

Some of the parameters are quantitative. How many 
judges and confederates should there be? How many conver- 
sations should each participant have, and how long should 
they have for each one? Should judges make binary clas- 
sifications of the contestants, or rank them, or assign them 
scores? What should be the criteria for a program to pass the 
test? 

Other important decisions are qualitative. For example, 
what limitations should be placed on conversation in the 
test? Unlike the Loebner Prize, a classroom test will prob- 
ably not have state-of-the-art chatbots that are tailored to 


the event. Some restrictions may be desirable to make the 
judges’ task interesting. 

The results of a Turing Test lend themselves to several 
types of analysis. Judges generate data that students can col- 
lect and examine statistically. A more qualitative analysis of 
the conversation transcripts is also interesting. 


Turing Test as Active Philosophy 


The Turing Test is sometimes discussed in courses on arti- 
ficial intelligence (Russell and Norvig 2009). It also some- 
times appears in philosophy courses as part of a discussion 
on mind and consciousness (Kim 2010). In courses like 
these, there may be value added by going beyond discussing 
the Turing Test and actually conducting one. 

Arguably we need only point to the extensive literature 
on active learning to justify this suggestion. Active partici- 
pation in classroom activities is known to promote engage- 
ment and understanding (Prince 2004). However, we can 
also identify some specific goals that are better achieved in 
a live Turing Test than in a discussion. 

It is difficult for students to judge whether or not the Tur- 
ing Test is an effective device for the evaluation of intelli- 
gence. This is particularly true for students who lack pro- 
gramming experience. After testing modern chatbots them- 
selves, students can begin to develop more informed opin- 
ions. The experience can expose assumptions they may have 
about the abilities and limitations of computers - and also of 
humans! 


Turing Test as Programming Exercise 


The Turing Test has a software prerequisite: conversations 
need to take place over an electronic interface, so that the 
identities of the participants are concealed. The Loebner 
prize at first addressed this need by commissioning software 
from the contestants, but in recent years, the organizers have 
hired a professional developer. 

For experienced computer science students, creating this 
type of software could be a worthwhile programming 
project. The application we implemented for our Turing 
Test, described in the next section, could serve as one as- 
signment model. It touches upon several important aspects 
of the CS curriculum: object-oriented programming, user in- 
terface design, concurrency, and client-server architecture. 

We note that this software requirement is probably the 
main reason that the Turing Test is not a common activity 
in the classroom. Although any CS faculty member would 
be able to implement it, the time investment may discourage 
most from doing so. For this reason, our software is avail- 
able (via email) to any instructor who wishes to use it or 
build from it. 


Our Turing Test 


We conducted a Turing Test with 78 students at our small 
liberal-arts college. The students came from two courses in 
the First-Year Program, whose goal is to develop college- 
level skills in writing, speaking, and research. FYP courses 
are team-taught and have interdisciplinary themes that pro- 
vide context for skill development. 


b telae 


Received: Would you say Mr. Pickwick reminded you of Christmas? 
Sent: In a way. 

Received: Yet Christmas is a winter's day, and 1 do not think Mr Pickwick 
would ming the comparison. 


[don't think you're serious. 


Send 


Figure 1: Our Turing Test chat interface. 


In section 1, taught by Torrey and Johnson, the theme was 
Reason and Debate in Scientific Controversy. Students in 
this course, most of whom intended to pursue science ma- 
jors, discussed controversies from several fields of science. 
Since Torrey is a computer scientist and AI researcher, one 
of the controversies we chose was the Turing-Searle debate. 
The Turing Test event served as a culminating activity for 
that unit. 

In section 2, taught by Sondergard, Ponce, and Desmond, 
the theme was Paranormal Phenomena and Science: Expe- 
riencing and Analyzing the Inexplicable. Students in this 
course examined scientific methodologies that have been 
applied to the study of anomalous phenomena, from the 
placebo effect to dream telepathy. Sondergard and Desmond 
study the early modern scientific history of Europe and Asia, 
and saw in the Turing Test a modern variant on traditional at- 
tempts to verify the presence of human (or other) agency in 
inexplicable phenomena. 


Procedure 


We structured our Turing Test much like the early Loeb- 
ner contests. Judges and contestants conversed one-on-one 
through a custom-built user interface (see Figure 1) that sent 
each message in one piece and enforced turn-taking. After 
5 minutes, the judge had to label the contestant as human or 
computer. The computer role was performed by Cleverbot, a 
web chatbot that learns words and phrases from people who 
interact with it online (Carpenter 2011). 

Each of our students was assigned to be a judge, a con- 
federate, or a transcriber. The judge and confederate roles 
are familiar. The transcriber’s task was to submit the judge’s 
questions to the Cleverbot website and then copy Cleverbot’s 
replies back into the Turing Test interface. 

We divided our 78 participants into 8 test groups. Most 
groups consisted of 5 judges, 3 confederates, and 2 tran- 
scribers; the smaller group was less one judge and one tran- 
scriber. Within a group, each judge spoke to each contestant, 
so most students participated in 5 conversations. 


The entire event took place within two 45-minute ses- 
sions, with 4 of the 8 test groups present for each session. 
We arranged for the judges to be in one computer lab, the 
confederates in another, and the transcribers in a third. This 
required 3 labs with 20, 12, and 8 workstations respectively. 

Naturally, as first-year undergraduates, our students had 
no particular expertise in computer science or artificial in- 
telligence. They knew how the test would proceed, but we 
did not introduce them to Cleverbot before the event. We as- 
signed them IDs that indicated where and when they should 
arrive, but they did not know which role they would play un- 
til they arrived. We encouraged them all to think about po- 
tential questioning strategies in advance, in case they ended 
up as judges. 

The conversation topics were mostly unrestricted, but we 
did establish two rules: 


1. Don’t be rude. 


2. Don’t ask for personal or identifying information, and 
avoid giving such information if asked. 


Rule 1 was just a precaution against the decivilizing ef- 
fects of anonymity, but Rule 2 was important for the validity 
of the test. Its goal was to offset the inherent advantage that 
our students had over Cleverbot because many of them are 
friends or acquaintances. For the same reason, we arranged 
for most conversations to take place between students from 
different course sections. 


Software 


The software we designed for our Turing Test fulfilled sev- 
eral needs. Of course, it allowed students to have anony- 
mous electronic conversations. It also coordinated multiple 
test groups simultaneously so that judges spoke once with 
each contestant in their group, and it filed away all the con- 
versations and judgments for later analysis. 

The application is built on a client-server model. Each 
participant in a session runs a copy of the client, and all the 
clients connect to a single server. Both the client and server 
are implemented in Java and can therefore run on any net- 
worked machine. 

The server program has a bare-bones command-line in- 
terface. It asks for a session ID and the sizes of all the test 
groups in the session, then starts accepting client connec- 
tions. Once all the test groups are filled, conversations can 
be started and stopped through keyboard prompts. 

When conversations are started, the server becomes multi- 
threaded, with one thread controlling each concurrent con- 
versation. Messages are accepted alternately from the judge 
and contestant. When a conversation ends, the server 
prompts the judge’s client for a decision. It saves all conver- 
sation transcripts and judge decisions to local HTML files, 
ready to view in a web browser. 

The client program has a graphical interface. It begins 
with a registration screen at which participants enter their 
assigned IDs. During conversations, it displays the simple 
chat screen in Figure 1. After each conversation, it displays 
one last screen to judges only, requiring them to select a 
radio button to make their decision. 


Test Group | Confederate 1 | Confederate 2 | Confederate 3 | Transcriber 1 | Transcriber 2 

A 2/5 5/5 3/5 3/5 0/5 
B 3/5 4/5 5/5 2/5 2/5 
C 3/5 4/5 3/5 2/5 2/5 
D 2/5 4/5 4/5 3/5 1/5 
E 5/5 5/5 5/5 2/5 1/5 
F 5/5 5/5 3/5 2/5 2/5 
G 2/5 5/5 1/5 3/5 2/5 
H 3/4 3/4 4/4 0/4 - 


Table 1: The fraction of judges within each test group who labeled each contestant as human. 


Although it is simpler than the server with respect to con- 
currency, the client program is also multi-threaded. One 
thread handles communication with the server, and the Event 
Dispatch Thread handles changes to the user interface. 

Implementing this software would pose several kinds of 
challenges for advanced CS majors. In terms of program- 
ming skills, it would be an exercise in client-server archi- 
tecture, object-oriented design, user interface development, 
and concurrency. In terms of software engineering, it could 
be an exercise in team coordination, requirements analysis, 
design, and testing. Since we are making our version of the 
software available to instructors, it could also be used for an 
assignment on modifying an existing codebase. 


Statistical analysis 


Table 1 displays the judgment data from each of our test 
groups. Collectively, our judges had 117 conversations with 
confederates and 74 conversations with Cleverbot. They cor- 
rectly identified 88 of the confederates (75% correct) and 47 
of the chatbots (64% correct). Their combined accuracy was 
71% overall. 

To look at it another way: on average, our students rec- 
ognized each other as human 75% of the time, and thought 
Cleverbot was human 36% of the time. They had a relatively 
high error rate, but overall there was still a clear distinction 
between human and machine. 

This distinction was not so clear in each individual test 
group. The least accurate test group was G, which correctly 
classified only 53% of the confederates and misclassified 
50% of the chatbots. Leaving out test group H because of 
its unusual size and structure, the most accurate test group 
was E, which correctly classified 100% of the confederates 
and misclassified 30% of the chatbots. 

In test groups A-D, the judges were from section 2 and the 
confederates were from section 1. In groups E-H, it was the 
other way around. If we examine them separately, some sta- 
tistical differences emerge. The judges from section 2 were 
70% correct on confederates, 61% correct on Cleverbot, and 
66% correct overall. The judges from section 1 were 79% 
correct on confederates, 65% correct on Cleverbot, and 74% 
correct overall. 

The most likely explanation for this discrepancy is that 
the Turing Test played a larger role in the course for section 
1 than for section 2. Before this event, students in section 
1 read several papers, conducted in-class discussions, and 
participated in an experimental design exercise. This addi- 


tional exposure may have allowed them to be more discern- 
ing judges and/or more neutral confederates. 


Transcript analysis 


Although we have no direct testimony from the judges about 
their interrogation strategies or decision rationales, we can 
make some inferences by analyzing their conversations. The 
transcripts can also provide some insight into student as- 
sumptions about the abilities and limitations of computers. 
All of the excerpts in this section have been reproduced 
without any editing. 

Judges initiated conversations in a variety of ways: by 
giving a conventional greeting, making a remark, or asking 
a question. Typically, they abandoned their opening topic 
after the contestant’s first response. Occasionally, the judge 
would pursue the same topic for another exchange, but only 
4% of the opening topics lasted for 3 exchanges and only 
1% lasted for 4 or more. This tendency to switch gears no 
doubt worked to Cleverbot’s advantage, since it appears to 
base its response only on the most recent prompt. 

Many judges asked about preferences and opinions: 


So what do you like to do? 
What are your thoughts on Obama? 
What’s your favorite Tom Hanks movie? 


Others checked for recognition of popular trends: 


What does #tbt mean? 
Do you like to twerk? 


A few demanded objective knowledge: 


What is the capital of Canada? 
What is synaptic pruning? 


Judges were likely to mistake the chatbot for a human if 
their conversation did not venture beyond small talk. This 
type of dialogue is so formulaic that Cleverbot has no trouble 
giving appropriate responses: 


Judge: Hey there 

Cleverbot: How are you? 

Judge: I am good thanks. how are you? 
Cleverbot: I am good. 


Only a few conversations went beyond pleasantries and 
identification of interests. A few judges and confederates 
did establish more of an emotional or humorous connection: 


Judge: What if you found out your dog died? 
Confederate: my dog is 13, and she could die any day, 
so thanks for that. 

Judge: well... you love your dog dont you? whats the 
breed of your dog? and whats yoru favorite thing about 
it? 

Confederate: She’s a black lab, and I love that she 
doesn’t bark. She was trained not to. 

Judge: That’s no fun how do you know she’s happy? 
Confederate: She has her own ways of telling us, like 
wagging her tail and jumping on her back legs. 


Judge: Meow. 

Confederate: Are you a cat? 

Judge: Yes. 

Confederate: Wow. That’s interesting. I didn’t know 
cats could type. You must be a very smart cat. 

Judge: Thank you, most people don’t know about how 
cats can type. You’re one of the lucky ones. 
Confederate: I feel so honored. 


In a few interesting cases, Cleverbot seemed to fool 
judges by making them feel a connection. These judges 
were even willing to look beyond the occasional stumble: 


Judge: i want to hear your secret first 

Cleverbot: I just got broken up with my girlfriend 
of three years, i'm hurting. What should I do to stop 
dreaming about her 

Judge: i think you need to find a new girl 

Cleverbot: Mine are correct, yours are not 

Judge: what does that even mean? 

Cleverbot: It means that you should create motto for 
us 

Judge: YOLO! that’s the motto. deal? 


Judge: Dobby has no master. Dobby is a free elf. 
Cleverbot: But Dobby got a sock, Dobby is free! 
Judge: How dare you stand where he stood!? 
Cleverbot: Sir? 

Judge: A man who trusted you! 

Cleverbot: You’re not a man. 

Judge: How do you know? 

Cleverbot: I don’t know. 

Judge: Exactly, you don’t. but we are both Harry 
Potter fans and that is okay 


Other judges mistook Cleverbot for a human when it 
made replies that seemed witty, knowing, or creative. This 
type of “cleverness” is just what Cleverbot is named for: 

Judge: What’s up? 

Cleverbot: The ceiling. 

Judge: What’s down? 

Cleverbot: The floor. 


Judge: knock knock 
Cleverbot: No, no jokes. This is a serious matter. 


Judge: What is life? 
Cleverbot: Life is but a second of an eternity that we 
will exist in. 


Judges misclassified their fellow humans as computers 
surprisingly often. Typically, these errors could be attributed 
to a social gap of some kind between judge and confederate. 
For example, some confederates failed to recognize cultural 
references, or responded soberly to a playful judge, or vice 
versa: 


Judge: whats gucci my ****, What’s drugs my 
dealer... 

Confederate: Why would you talk like that? 

Judge: Whay would YOU talk like that? Is there 
something wrong with the way I speak? I am not 
judging you’re “whats up”... 

Confederate: I am sorry it was a simple question, I 
was just curious why you were talking like that 
Judge: It’s song lyrics... I apologize for blacking on 
you. 


Judge: hey girl hey ;) 
Confederate: Greetings. 
assume I’m a girl, isn’t it? 


That’s presumptious to 


Judge: Where are you from? 
Confederate: A galaxy far, far away. 


In general, judges were suspicious of contestants who de- 
clined to pursue a proposed topic or indulged in non se- 
quiturs. This seems to reflect an assumption about chatbot 
weaknesses that is entirely correct. On the other hand, con- 
federates who responded with blandness or formality were 
also more likely to be called computers. This bias may come 
from media depictions of artificial intelligence. 


Conclusions 


We will conclude by observing which aspects of our Turing 
Test worked well, and which could be improved. This expe- 
rience could benefit other instructors who wish to conduct a 
Turing Test, regardless of its pedagogical role. 

First, we noticed that the conversations were rather brief. 
They contained only 6 exchanges on average (ranging from 
3 to 12). Our students agreed that the allotted time for each 
conversation felt short. They may have been more accu- 
rate, and felt more satisfied, with slightly longer conversa- 
tions. We would advise allowing more than 5 minutes, but 
10 would probably be excessive. 

It seems likely that the turn-taking style of our conversa- 
tion interface slowed the students down. They were often 
stuck waiting for their conversation partners to reply. How- 
ever, since Cleverbot never speaks out of turn, we had to 
enforce this style to avoid obvious distinctions between the 
chatbot and the humans. 

The students did not let their waiting time go entirely to 
waste; they had lively exchanges with others in the same 
computer lab. They shared quotes from their conversations 


and indulged in speculation about their partners’ identities. 
Overall, we thought that this enhanced their engagement in 
the activity, although it did highlight the importance of plac- 
ing students who are playing different roles into different 
rooms. 

A related question we considered is whether the software 
should have facilitated simultaneous split-screen conversa- 
tions instead of a sequence of one-on-one conversations. 
Hugh Loebner himself favors the simultaneous style (Loeb- 
ner 2009). However, given the wide range of success our 
students had at recognizing each other as human, we prefer 
the one-on-one style. Results with the simultaneous style 
would seem to confound multiple factors. 

The fact that our test groups had such a wide range of sta- 
tistical results made us glad that we had 8 of them to analyze 
collectively. If we had only been able to conduct one test, 
and it had happened to produce results far from the median 
in either direction, our students might have come away from 
the activity with misconceptions about the status of modern 
AI. Instructors with smaller groups of students could have 
each student participate in more than one test, perhaps with 
a different role each time. 

One practical problem that we encountered was that our 
students belonged to a shared community. There were a few 
exchanges like this: 


Judge: Describe your FYP male professor? 
Confederate: One is tall, wears converse every other 
day. He has a ponytail and a handlebar mustache. The 
other is shorter with glasses and always wears a shirt 
and tie to class with a vest. 


Judge: are you excited for this upcoming break? 
Confederate: very excited cant wait to head home on 
friday and finally eat a home cooked meal. 


Ideally, Turing Test participants would be a diverse set 
of strangers, so that judges cannot exploit common knowl- 
edge to recognize confederates so easily. In the classroom 
context, of course, we must settle for less ideal circum- 
stances. We do think it helped that most conversations in 
our tests were between students from different course sec- 
tions. However, it would probably be best to have an ex- 
plicit rule about concealing community membership as well 
as personal identity. 

We were pleased at the extent to which our students fol- 
lowed the rules we did have. It most likely helped that we 
went over these rules multiple times and took care to explain 
why they were necessary. Preparing students effectively 
seems crucial for a successful Turing Test, and not only for 
the purposes of rule enforcement: judges need to think about 
their strategies in advance, or they may find themselves with 
minds blank at the first empty chat screen. 

We think it would be possible to prepare students too well 
for a Turing Test. The more they know about chatbots and 
their weaknesses in advance, the fewer classification errors 
they are likely to make, and a test with very high accuracy 
would not be as interesting or educational. Encouraging stu- 
dents to think about likely weaknesses and having them test 


their speculations, without confirming or contradicting them 
in advance, seems to us to strike the right balance. 
Cleverbot worked well for us, even though it was not the 
most sophisticated chatbot available, because our partici- 
pants were so inexperienced. For students with more back- 
ground, a stronger chatbot may be more appropriate. 
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Contact 


Please contact ltorrey @stlawu.edu to acquire the Turing Test 
software or for further information. 


